Getting started

Download the Spotify data and Spotify starter R script from the course website and save them in the folder that you have been using for this class.

Open the R file and set R’s working directory to the folder that this file is in. (You can do that through the menu bar at the top of the screen: Session > Set Working Directory > To Source File Location.) Run the code in the R script and try to understand what the code is doing.

Testing the difference in tempo between songs of different mode

Each song comes in a particular key, and keys are grouped into “Major” keys and “Minor” keys. Major keys are generally happy-sounding, while minor keys sound more melancholic. Let’s say we want to test if the difference in tempo (speed of the song) between songs in major and minor keys. One way to do this is to test if the mean tempo of songs in major keys is different from that for songs in minor keys. The code below computes the tempo means for each mode:

df %>% group_by(mode) %>%
    summarize(mean_tempo = mean(tempo))
## # A tibble: 2 x 2
##   mode  mean_tempo
##   <fct>      <dbl>
## 1 Minor       116.
## 2 Major       122.

We see that songs in minor key have mean tempo of ~116 bpm, while songs in major key have a slightly faster mean tempo of ~122 bpm. Is this difference significant? To test for significance, one option we have is to use the two-sided \(t\)-test. (Whether the \(t\)-test is appropriate is a question left for a statistical methods class.)

The code for performing a two-sided \(t\)-test is below:

major_data <- (df %>% filter(mode == "Major"))$tempo
minor_data <- (df %>% filter(mode == "Minor"))$tempo
t.test(major_data, minor_data, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  major_data and minor_data
## t = 1.0377, df = 97.475, p-value = 0.302
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.152959 16.446581
## sample estimates:
## mean of x mean of y 
##  121.5741  115.9273

From the readout, the \(p\)-value for this test is around 0.30, which is fairly large. We wouldn’t reject the null hypothesis in favor of the alternative hypothesis.

The readout also gives us a confidence interval: this is an interval which will capture the true value of the difference 95% of the time (if we were to repeat this procedure over and over again).

The \(t\)-test makes some strong assumptions on how the data was generated. (We call it a parametric test). If we don’t want to make assumptions on how the data was generated, we can use a non-parametric test, such as the Kolmogorov-Smirnov test, which tests whether the distribution of 2 variables is the same or not:

ks.test(major_data, minor_data, alternative = "two.sided")
## Warning in ks.test(major_data, minor_data, alternative = "two.sided"):
## cannot compute exact p-value with ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  major_data and minor_data
## D = 0.13136, p-value = 0.7946
## alternative hypothesis: two-sided

Relationship between valence and loudness

Next, let’s perform linear regression on valence vs. loudness. We expect increasing loudness to result in increasing valence. Linear regression is easily done in R with the lm function:

lm(valence ~ loudness, data = df)
## 
## Call:
## lm(formula = valence ~ loudness, data = df)
## 
## Coefficients:
## (Intercept)     loudness  
##     0.79386      0.04897

The line above finds the coefficients \(a_1\) and \(a_2\) such that the line \(valence = a_1 + a_2 \cdot loudness\) fits the data best. From the output, we can see that the line of best fit is \(score = 0.79386 + 0.04897 \cdot alc\). The coefficients mean that for every unit increase in the loudness scale, valence increases correspondingly by 0.04897. This matches our expectations.

The output of the lm function doesn’t give us any information other than the coefficients. We can use the summary function to get more information:

fit <- lm(valence ~ loudness, data = df)
summary(fit)
## 
## Call:
## lm(formula = valence ~ loudness, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39188 -0.12018 -0.00328  0.14860  0.40816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.79386    0.06570   12.08  < 2e-16 ***
## loudness     0.04897    0.01108    4.42 2.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1986 on 98 degrees of freedom
## Multiple R-squared:  0.1662, Adjusted R-squared:  0.1577 
## F-statistic: 19.54 on 1 and 98 DF,  p-value: 2.548e-05

Residuals refer to the differences between the actual score, and the score predicted by the linear regression. This is something that we look at to check if the linear regression model makes sense.

Instead of just the value of the coefficients, we have more information in a table. Look at the p-values in the last column. The p-value for loudness is the result of testing the null hypothesis coefficient for loudness \(= 0\) (i.e. loudness is uncorrelated with valence) vs. the alternative hypothesis coefficient for loudness \(\neq 0\) (i.e. loudness is correlated with valence). The p-value is very small, and so the data is not consistent with the null hypothesis (i.e. no relationship). We would conclude that there is a statistically signifcant relationship between loudness and valence.

We can plot the linear fit on a scatterplot using geom_smooth and specify the method as “lm”:

ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
    geom_point() +
    geom_smooth(method = "lm")

Modeling valence as a function of loudness and mode

It is likely that valence is influenced by more than 1 variable. We can fit more complex linear models with the lm function as well. The code below fits the additive model \(valence = a_1 + a_2 \cdot loudness + a_3 \cdot modeMajor\):

fit <- lm(valence ~ loudness + mode, data = df)
summary(fit)
## 
## Call:
## lm(formula = valence ~ loudness + mode, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3945 -0.1170 -0.0027  0.1498  0.4119 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.79121    0.06839  11.569  < 2e-16 ***
## loudness     0.04912    0.01118   4.394 2.85e-05 ***
## modeMajor    0.00605    0.04061   0.149    0.882    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1996 on 97 degrees of freedom
## Multiple R-squared:  0.1664, Adjusted R-squared:  0.1492 
## F-statistic: 9.684 on 2 and 97 DF,  p-value: 0.0001464

When faced with a categorical variable, R chooses a baseline category (in this case, “Minor”). It then creates a series of other “dummy” variables for all the other categories. For a particular row, the value of the dummy variable is 1 if the row belongs to that category, 0 otherwise. In this example, the only other category for mode is “Major”, so R creates a dummy variable modeMajor which has value 1 if the row belongs to a song in the major key, 0 otherwise.

The issue with the additive model is that it constrains the gradient of the fit to be the same for both males and females (see slides). To fix that, we can fit a model with interactions effects instead:

fit <- lm(valence ~ loudness * mode, data = df)
summary(fit)
## 
## Call:
## lm(formula = valence ~ loudness * mode, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39179 -0.12289 -0.00013  0.14712  0.42106 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.82557    0.09463   8.724 8.16e-14 ***
## loudness            0.05541    0.01638   3.384  0.00104 ** 
## modeMajor          -0.06058    0.13270  -0.457  0.64905    
## loudness:modeMajor -0.01186    0.02248  -0.528  0.59898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2004 on 96 degrees of freedom
## Multiple R-squared:  0.1688, Adjusted R-squared:  0.1429 
## F-statistic: 6.501 on 3 and 96 DF,  p-value: 0.0004735

With this code, R fits the model \(valence = a_1 + a_2 \cdot loudness + a_3 \cdot modeMajor + \color{blue}{a_4 \cdot loudness \cdot modeMajor}\), where \(modeMajor = 1\) if the row has mode “Major”, and \(modeMajor = 0\) otherwise.

We can plot the linear fit on scatterplot:

ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) + 
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    facet_wrap(~ mode)

The Spotify final.Rmd file has a fair amount of optional material that goes deeper in some other aspects of modeling you could do in R.

Optional material

Getting predictions from the model

How can we get predictions from the model? We can use the predict function, with the first argument being the model (i.e. output of lm), and the second argument being the data on which to predict. (The data must be in the exact same form as the data the model was trained on, columns and all.)

This code gives the predictions on the training dataset:

fit <- lm(valence ~ loudness, data = df)
predict(fit, df)
##         1         2         3         4         5         6         7 
## 0.6379884 0.5819175 0.5609092 0.4623810 0.5840722 0.4588062 0.4708529 
##         8         9        10        11        12        13        14 
## 0.5469037 0.5509193 0.3837838 0.4821161 0.4790799 0.5478342 0.5768246 
##        15        16        17        18        19        20        21 
## 0.3631673 0.5874511 0.6047376 0.5554735 0.5946497 0.5830438 0.5531719 
##        22        23        24        25        26        27        28 
## 0.5613010 0.4315787 0.5962658 0.4840259 0.4577289 0.6286351 0.4125783 
##        29        30        31        32        33        34        35 
## 0.5196763 0.5695280 0.5029774 0.4496977 0.6423468 0.6323079 0.4874048 
##        36        37        38        39        40        41        42 
## 0.4151737 0.2367261 0.4709508 0.5198721 0.5085110 0.4847605 0.5946497 
##        43        44        45        46        47        48        49 
## 0.6418081 0.5877449 0.5733966 0.4635563 0.2325636 0.6508186 0.5336818 
##        50        51        52        53        54        55        56 
## 0.5821623 0.6765280 0.3319243 0.5338776 0.5489115 0.5056218 0.6324058 
##        57        58        59        60        61        62        63 
## 0.5656104 0.4548396 0.2845210 0.5841701 0.4626749 0.5228104 0.4973948 
##        64        65        66        67        68        69        70 
## 0.5312332 0.5272177 0.4919101 0.3236972 0.4668373 0.6380864 0.5148282 
##        71        72        73        74        75        76        77 
## 0.5671284 0.6288310 0.4828506 0.4213440 0.4859357 0.6423957 0.3884359 
##        78        79        80        81        82        83        84 
## 0.6217303 0.4815284 0.5428392 0.5881857 0.6433751 0.5726131 0.5057687 
##        85        86        87        88        89        90        91 
## 0.5280012 0.4206584 0.4884332 0.5445042 0.4801572 0.5442104 0.4097380 
##        92        93        94        95        96        97        98 
## 0.5934255 0.5237408 0.5318699 0.3909334 0.5607133 0.5171298 0.4400016 
##        99       100 
## 0.5538085 0.4709998

Let’s plot these predictions on the scatterplot to make sure we got it right:

df$predictions <- predict(fit, df)
ggplot(data = df) +
    geom_point(aes(x = loudness, y = valence)) +
    geom_smooth(aes(x = loudness, y = valence), method = "lm", se = FALSE) +
    geom_point(aes(x = loudness, y = predictions), col = "red")

Here are the predictionrs on a randomly generated set of data:

new_df <- data.frame(loudness = c(-10, -3.6, -3.74, -8, -5.22))
predict(fit, new_df)
##         1         2         3         4         5 
## 0.3041581 0.6175678 0.6107120 0.4020986 0.5382360

More information from the model

If you plot the output of the lm call, you will get a series of plots which give you more information on the fit:

fit <- lm(valence ~ loudness, data = df)
par(mfrow = c(2,2))
plot(fit)

Confidence intervals

The following code gives the mean tempo for all the songs:

mean(df$tempo)
## [1] 119.2025

We can use the following code to get an \(x\)% confidence interval for the mean tempo:

x <- 0.9
t.test(df$tempo, conf.level = x)
## 
##  One Sample t-test
## 
## data:  df$tempo
## t = 42.644, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  114.5612 123.8437
## sample estimates:
## mean of x 
##  119.2025

To get confidence intervals for parameters in a linear model, we can use the confint function (if level = x in the confint call, it will give the endpoints of the x% confidence interval):

x <- 0.95
fit <- lm(valence ~ loudness, data = df)
confint(fit, level = x)
##                  2.5 %     97.5 %
## (Intercept) 0.66349030 0.92423126
## loudness    0.02698615 0.07095438

Plotting song labels

It might be informative to have the song names in the scatterplot, especially when we want to identify outliers. The following plot is pretty crowded, so putting all the song names might not be a good idea. In any case, here is how you can do it (play around with the different options in geom_text to see what they do):

ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
    geom_point() + 
    geom_text(aes(label = name, col = mode),
              hjust = 0, nudge_x = 0.1, angle = 45, size = 3) +
    facet_grid(. ~ mode)

It’s a little bit of a mess. We can do slightly better by loading the ggrepel package, and replacing geom_text with geom_text_repel (and some change of options):

library(ggrepel)
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
    geom_point() + 
    geom_text_repel(aes(label = name, col = mode), size = 3) +
    facet_grid(. ~ mode)

Drawing the linear fit for an additive model

I couldn’t find an easy way to draw the linear fits for an additive model in the facetted scatterplot. (If you find a way, let me know!)

This is a workaround:

# get fit coefficients
fit <- lm(valence ~ loudness + mode, data = df)
coefs <- fit$coefficients

# add coefficients to dataset
df$slope <- coefs[2]
df$intercept <- ifelse(df$mode == "Minor", coefs[1], coefs[1] + coefs[3])

# plot
ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) + 
    geom_point() +
    geom_abline(aes(slope = slope, intercept = intercept, col = mode)) +
    facet_wrap(~ mode)

Using plotly for interactive figures

The plotly package is great for making interactive figures with simple syntax. For example, here is the plot of valence vs. loudness, with the color denoting the mode of the song:

library(plotly)
plot_ly(data = df, x = ~loudness, y = ~valence, color = ~mode)

If you hover the cursor over each observation, you will see that it gives the values of loudness, valence and mode for this point. The code below changes the mouseover information to the song title:

plot_ly(data = df, x = ~loudness, y = ~valence, color = ~mode, 
        hoverinfo = "text", text = ~name)

(To keep the original information that the data labels had, remove the hoverinfo option from the function call above.) More documentation on the plotly package can be found at https://plot.ly/r/.

Session info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotly_4.9.0    ggrepel_0.8.1   forcats_0.4.0   stringr_1.4.0  
##  [5] dplyr_0.8.3     purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
##  [9] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.9           reshape2_1.4.3    
##  [4] haven_2.1.1        lattice_0.20-38    colorspace_1.4-1  
##  [7] generics_0.0.2     vctrs_0.2.0        viridisLite_0.3.0 
## [10] htmltools_0.3.6    yaml_2.2.0         utf8_1.1.4        
## [13] rlang_0.4.0        later_0.8.0        pillar_1.4.2      
## [16] glue_1.3.1         withr_2.1.2        RColorBrewer_1.1-2
## [19] modelr_0.1.5       readxl_1.3.1       plyr_1.8.4        
## [22] munsell_0.5.0      gtable_0.3.0       cellranger_1.1.0  
## [25] rvest_0.3.4        htmlwidgets_1.3    evaluate_0.14     
## [28] labeling_0.3       knitr_1.24         httpuv_1.5.2      
## [31] crosstalk_1.0.0    fansi_0.4.0        broom_0.5.2       
## [34] Rcpp_1.0.2         xtable_1.8-4       promises_1.0.1    
## [37] scales_1.0.0       backports_1.1.4    jsonlite_1.6      
## [40] mime_0.7           hms_0.5.1          digest_0.6.20     
## [43] stringi_1.4.3      shiny_1.3.2        grid_3.6.1        
## [46] cli_1.1.0          tools_3.6.1        magrittr_1.5      
## [49] lazyeval_0.2.2     crayon_1.3.4       pkgconfig_2.0.2   
## [52] zeallot_0.1.0      data.table_1.12.2  xml2_1.2.2        
## [55] lubridate_1.7.4    assertthat_0.2.1   rmarkdown_1.15    
## [58] httr_1.4.1         rstudioapi_0.10    R6_2.4.0          
## [61] nlme_3.1-140       compiler_3.6.1